    On the Enhancement of Remote GPU Virtualization in High Performance Clusters

    Graphics Processing Units (GPUs) are being adopted in many computing facilities given their extraordinary computing power, which makes it possible to accelerate many general purpose applications from different domains. However, GPUs also present several side effects, such as increased acquisition costs as well as larger space requirements. They also require more powerful energy supplies. Furthermore, GPUs still consume some amount of energy while idle and their utilization is usually low for most workloads. In a similar way to virtual machines, the use of virtual GPUs may address the aforementioned concerns. In this regard, the remote GPU virtualization mechanism allows an application being executed in a node of the cluster to transparently use the GPUs installed at other nodes. Moreover, this technique allows to share the GPUs present in the computing facility among the applications being executed in the cluster. In this way, several applications being executed in different (or the same) cluster nodes can share one or more GPUs located in other nodes of the cluster. Sharing GPUs should increase overall GPU utilization, thus reducing the negative impact of the side effects mentioned before. Reducing the total amount of GPUs installed in the cluster may also be possible. In this dissertation we enhance one framework offering remote GPU virtualization capabilities, referred to as rCUDA, for its use in high-performance clusters. While the initial prototype version of rCUDA demonstrated its functionality, it also revealed concerns with respect to usability, performance, and support for new GPU features, which prevented its used in production environments. These issues motivated this thesis, in which all the research is primarily conducted with the aim of turning rCUDA into a production-ready solution for eventually transferring it to industry. The new version of rCUDA resulting from this work presents a reduction of up to 35% in execution time of the applications analyzed with respect to the initial version. Compared to the use of local GPUs, the overhead of this new version of rCUDA is below 5% for the applications studied when using the latest high-performance computing networks available.Las unidades de procesamiento gráfico (Graphics Processing Units, GPUs) están siendo utilizadas en muchas instalaciones de computación dada su extraordinaria capacidad de cálculo, la cual hace posible acelerar muchas aplicaciones de propósito general de diferentes dominios. Sin embargo, las GPUs también presentan algunas desventajas, como el aumento de los costos de adquisición, así como mayores requerimientos de espacio. Asimismo, también requieren un suministro de energía más potente. Además, las GPUs consumen una cierta cantidad de energía aún estando inactivas, y su utilización suele ser baja para la mayoría de las cargas de trabajo. De manera similar a las máquinas virtuales, el uso de GPUs virtuales podría hacer frente a los inconvenientes mencionados. En este sentido, el mecanismo de virtualización remota de GPUs permite que una aplicación que se ejecuta en un nodo de un clúster utilice de forma transparente las GPUs instaladas en otros nodos de dicho clúster. Además, esta técnica permite compartir las GPUs presentes en el clúster entre las aplicaciones que se ejecutan en el mismo. De esta manera, varias aplicaciones que se ejecutan en diferentes nodos de clúster (o los mismos) pueden compartir una o más GPUs ubicadas en otros nodos del clúster. Compartir GPUs aumenta la utilización general de la GPU, reduciendo así el impacto negativo de las desventajas anteriormente mencionadas. De igual forma, este mecanismo también permite reducir la cantidad total de GPUs instaladas en el clúster. En esta tesis mejoramos un entorno de trabajo llamado rCUDA, el cual ofrece funcionalidades de virtualización remota de GPUs para su uso en clusters de altas prestaciones. Si bien la versión inicial del prototipo de rCUDA demostró su funcionalidad, también reveló dificultades con respecto a la usabilidad, el rendimiento y el soporte para nuevas características de las GPUs, lo cual impedía su uso en entornos de producción. Estas consideraciones motivaron la presente tesis, en la que toda la investigación llevada a cabo tiene como objetivo principal convertir rCUDA en una solución lista para su uso entornos de producción, con la finalidad de transferirla eventualmente a la industria. La nueva versión de rCUDA resultante de este trabajo presenta una reducción de hasta el 35% en el tiempo de ejecución de las aplicaciones analizadas con respecto a la versión inicial. En comparación con el uso de GPUs locales, la sobrecarga de esta nueva versión de rCUDA es inferior al 5% para las aplicaciones estudiadas cuando se utilizan las últimas redes de computación de altas prestaciones disponibles.Les unitats de processament gràfic (Graphics Processing Units, GPUs) estan sent utilitzades en moltes instal·lacions de computació donada la seva extraordinària capacitat de càlcul, la qual fa possible accelerar moltes aplicacions de propòsit general de diferents dominis. No obstant això, les GPUs també presenten alguns desavantatges, com l'augment dels costos d'adquisició, així com major requeriment d'espai. Així mateix, també requereixen un subministrament d'energia més potent. A més, les GPUs consumeixen una certa quantitat d'energia encara estant inactives, i la seua utilització sol ser baixa per a la majoria de les càrregues de treball. D'una manera semblant a les màquines virtuals, l'ús de GPUs virtuals podria fer front als inconvenients esmentats. En aquest sentit, el mecanisme de virtualització remota de GPUs permet que una aplicació que s'executa en un node d'un clúster utilitze de forma transparent les GPUs instal·lades en altres nodes d'aquest clúster. A més, aquesta tècnica permet compartir les GPUs presents al clúster entre les aplicacions que s'executen en el mateix. D'aquesta manera, diverses aplicacions que s'executen en diferents nodes de clúster (o els mateixos) poden compartir una o més GPUs ubicades en altres nodes del clúster. Compartir GPUs augmenta la utilització general de la GPU, reduint així l'impacte negatiu dels desavantatges anteriorment esmentades. A més a més, aquest mecanisme també permet reduir la quantitat total de GPUs instal·lades al clúster. En aquesta tesi millorem un entorn de treball anomenat rCUDA, el qual ofereix funcionalitats de virtualització remota de GPUs per al seu ús en clústers d'altes prestacions. Si bé la versió inicial del prototip de rCUDA va demostrar la seua funcionalitat, també va revelar dificultats pel que fa a la usabilitat, el rendiment i el suport per a noves característiques de les GPUs, la qual cosa impedia el seu ús en entorns de producció. Aquestes consideracions van motivar la present tesi, en què tota la investigació duta a terme té com a objectiu principal convertir rCUDA en una solució preparada per al seu ús entorns de producció, amb la finalitat de transferir-la eventualment a la indústria. La nova versió de rCUDA resultant d'aquest treball presenta una reducció de fins al 35% en el temps d'execució de les aplicacions analitzades respecte a la versió inicial. En comparació amb l'ús de GPUs locals, la sobrecàrrega d'aquesta nova versió de rCUDA és inferior al 5% per a les aplicacions estudiades quan s'utilitzen les últimes xarxes de computació d'altes prestacions disponibles.Reaño González, C. (2017). On the Enhancement of Remote GPU Virtualization in High Performance Clusters [Tesis doctoral]. Universitat Politècnica de València. https://doi.org/10.4995/Thesis/10251/86219

    A performance comparison of CUDA remote GPU virtualization frameworks

    © 2015 IEEE. Personal use of this material is permitted. Permission from IEEE must be obtained for all other uses, in any current or future media, including reprinting/republishing this material for advertising or promotional purposes, creating new collective works, for resale or redistribution to servers or lists, or reuse of any copyrighted component of this work in other works.Using GPUs reduces execution time of many applications but increases acquisition cost and power consumption. Furthermore, GPUs usually attain a relatively low utilization. In this context, remote GPU virtualization solutions were recently created to overcome the drawbacks of using GPUs. Currently, many different remote GPU virtualization frameworks exist, all of them presenting very different characteristics. These differences among them may lead to differences in performance. In this work we present a performance comparison among the only three CUDA remote GPU virtualization frameworks publicly available at no cost. Results show that performance greatly depends on the exact framework used, being the rCUDA virtualization solution the one that stands out among them. Furthermore, rCUDA doubles performance over CUDA for pageable memory copies.This work was funded by the Generalitat Valenciana under Grant PROMETEOII/2013/009 of the PROMETEO program phase II. Authors are also grateful for the generous support provided by Mellanox TechnologiesReaño González, C.; Silla Jiménez, F. (2015). A performance comparison of CUDA remote GPU virtualization frameworks. IEEE. https://doi.org/10.1109/CLUSTER.2015.76

    Reducing the Costs of Teaching CUDA in Laboratories while Maintaining the Learning Experience Quality

    Graphics Processing Units (GPUs) have become widely used to accelerate scientific applications; therefore, it is important that Computer Science and Computer Engineering curricula include the fundamentals of parallel computing with GPUs. Regarding the practical part of the training, one important concern is how to introduce GPUs into a laboratory: installing GPUs in all the computers of the lab may not be affordable, while sharing a remote GPU server among several students may result in a poor learning experience because of its associated overhead. In this paper we propose a solution to address this problem: the use of the rCUDA (remote CUDA) middleware, which enables programs being executed in a computer to make concurrent use of GPUs located in remote servers. Hence, students would be able to concurrently and transparently share a single remote GPU from their local machines in the laboratory without having to log into the remote server. In order to demonstrate that our proposal is feasible, we present results of a real scenario. The results show that the cost of the laboratory is noticeably reduced while the learning experience quality is maintained.Reaño González, C.; Silla Jiménez, F. (2015). Reducing the Costs of Teaching CUDA in Laboratories while Maintaining the Learning Experience Quality. En INTED2015 Proceedings. IATED. 3651-3660. http://hdl.handle.net/10251/70229S3651366

    InfiniBand verbs optimizations for remote GPU virtualization

    © 2015 IEEE. Personal use of this material is permitted. Permission from IEEE must be obtained for all other uses, in any current or future media, including reprinting/republishing this material for advertising or promotional purposes, creating new collective works, for resale or redistribution to servers or lists, or reuse of any copyrighted component of this work in other works.The use of InfiniBand networks to interconnect high performance computing clusters has considerably increased during the last years. So much so that the majority of the supercomputers included in the TOP500 list either use Ethernet or InfiniBand interconnects. Regarding the latter, due to the complexity of the InfiniBand programming API (i.e., InfiniBand Verbs) and the lack of documentation, there are not enough recent available studies explaining how to optimize applications to get the maximum performance from this fabric. In this paper we expose two different optimizations to be used when developing applications using InfiniBand Verbs, each providing an average bandwidth improvement of 3.68% and 217.14%, respectively. In addition, we show that when combining both optimizations, the average bandwidth gain is 43.29%. This bandwidth increment is key for remote GPU virtualization frameworks. Actually, this noticeable gain translates into a reduction of up to 35% in execution time of applications using remote GPU virtualization frameworks.This work was funded by the Generalitat Valenciana under Grant PROMETEOII/2013/009 of the PROMETEO program phase II. Authors are also grateful for the generous support provided by Mellanox TechnologiesReaño González, C.; Silla Jiménez, F. (2015). InfiniBand verbs optimizations for remote GPU virtualization. IEEE. https://doi.org/10.1109/CLUSTER.2015.139

    CU2rCU: A CUDA-to-rCUDA Converter

    [ES] Las GPUs (Graphics Processor Units, unidades de procesamiento gráfico) están siendo cada vez más utilizadas en el campo de la HPC (High Performance Computing, computación de altas prestaciones) como una forma eficaz de reducir el tiempo de ejecución de las aplicaciones mediante la aceleración de determinadas partes de las mismas. CUDA (Compute Unified Device Architecture, arquitectura de dispositivos de cómputo unificado) es una tecnología desarrollada por NVIDIA que permite llevar a cabo dicha aceleración, proporcionando para ello una arquitectura de cálculo paralelo. Sin embargo, la utilización de GPUs en el ámbito de la HPC presenta ciertas desventajas, principalmente, en el coste de adquisición y el aumento de energía que introducen. Para hacer frente a estos inconvenientes se desarrolló rCUDA (remote CUDA, CUDA remoto), una tecnología que permite compartir dispositivos CUDA de forma remota, reduciendo así tanto el coste de adquisición como el consumo de energía. En las versiones iniciales de rCUDA quedó demostrada su viabilidad, pero también se identificaron algunos aspectos susceptibles de ser mejorados en relación con su usabilidad. Ésta se veía afectada por el hecho de que rCUDA no soporta las extensiones de CUDA al lenguaje C. De esta forma, era necesario convertir manualmente las aplicaciones CUDA eliminando dichas extensiones, y utilizando únicamente C plano. En este documento presentamos una herramienta que realiza éstas conversiones de manera automática, permitiendo así adaptar las aplicaciones CUDA a rCUDA de una manera sencilla[EN] GPUs (Graphics Processor Units) are being increasingly embraced by the high performance computing and computational communities as an effective way of considerably reducing application execution time by accelerating significant parts of their codes. CUDA (Compute Unified Device Architecture) is a new technology developed by NVIDIA which leverages the parallel compute engine in GPUs. However, the use of GPUs in current HPC clusters presents certain negative side-effects, mainly related with acquisition costs and power consumption. rCUDA (remote CUDA) was recently developed as a software solution to address these concerns. Specifically, it is a middleware that allows transparently sharing a reduced number of CUDA-compatible GPUs among the nodes in a cluster, reducing acquisition costs and power consumption. While the initial prototype versions of rCUDA demonstrated its functionality, they also revealed several concerns related with usability and performance. With respect to usability, the rCUDA framework was limited by its lack of support for the CUDA extensions to the C language. Thus, it was necessary to manually convert the original CUDA source code into C plain code functionally identical but that does not include such extensions. For such purpose, in this document we present a new component of the rCUDA suite that allows an automatic transformation of any CUDA source code into plain C code, so that it can be effectively accommodated within the rCUDA technology.Reaño González, C. (2012). CU2rCU: A CUDA-to-rCUDA Converter. http://hdl.handle.net/10251/27435Archivo delegad

    Accelerator Virtualization in Fog Computing: Moving from the Cloud to the Edge

    [EN] Hardware accelerators are available on the cloud for enhanced analytics. Next-generation clouds aim to bring enhanced analytics using accelerators closer to user devices at the edge of the network for improving quality of service (QoS) by minimizing end-to-end latencies and response times. The collective computing model that utilizes resources at the cloud-edge continuum in a multi-tier hierarchy comprising the cloud, edge, and user devices is referred to as fog computing. This article identifies challenges and opportunities in making accelerators accessible at the edge. A holistic view of the fog architecture is key to pursuing meaningful research in this area.Varghese, B.; Reaño González, C.; Silla Jiménez, F. (2018). Accelerator Virtualization in Fog Computing: Moving from the Cloud to the Edge. IEEE Cloud Computing. 5(6):28-37. https://doi.org/10.1109/MCC.2018.064181118S28375

    On the Effect of using rCUDA to Provide CUDA Acceleration to Xen Virtual Machines

    [EN] Nowadays, many data centers use virtual machines (VMs) in order to achieve a more efficient use of hardware resources. The use of VMs provides a reduction in equipment and maintenance expenses as well as a lower electricity consumption. Nevertheless, current virtualization solutions, such as Xen, do not easily provide graphics processing units (GPUs) to applications running in the virtualized domain with the flexibility usually required in data centers (i.e., managing virtual GPU instances and concurrently sharing them among several VMs). Therefore, the execution of GPU-accelerated applications within VMs is hindered by this lack of flexibility. In this regard, remote GPU virtualization solutions may address this concern. In this paper we analyze the use of the remote GPU virtualization mechanism to accelerate scientific applications running inside Xen VMs. We conduct our study with six different applications, namely CUDA-MEME, CUDASW++, GPU-BLAST, LAMMPS, a triangle count application, referred to as TRICO, and a synthetic benchmark used to emulate different application behaviors. Our experiments show that the use of remote GPU virtualization is a feasible approach to address the current concerns of sharing GPUs among several VMs, featuring a very low overhead if an InfiniBand fabric is already present in the cluster.This work was funded by the Generalitat Valenciana under Grant PROMETEO/2017/077. Authors are also grateful for the generous support provided by Mellanox Technologies Inc.Prades, J.; Reaño González, C.; Silla Jiménez, F. (2019). On the Effect of using rCUDA to Provide CUDA Acceleration to Xen Virtual Machines.     Improving the management efficiency of GPU workloads in data centers through GPU virtualization

    [EN] Graphics processing units (GPUs) are currently used in data centers to reduce the execution time of compute-intensive applications. However, the use of GPUs presents several side effects, such as increased acquisition costs and larger space requirements. Furthermore, GPUs require a nonnegligible amount of energy even while idle. Additionally, GPU utilization is usually low for most applications. In a similar way to the use of virtual machines, using virtual GPUs may address the concerns associated with the use of these devices. In this regard, the remote GPU virtualization mechanism could be leveraged to share the GPUs present in the computing facility among the nodes of the cluster. This would increase overall GPU utilization, thus reducing the negative impact of the increased costs mentioned before. Reducing the amount of GPUs installed in the cluster could also be possible. However, in the same way as job schedulers map GPU resources to applications, virtual GPUs should also be scheduled before job execution. Nevertheless, current job schedulers are not able to deal with virtual GPUs. In this paper, we analyze the performance attained by a cluster using the remote Compute Unified Device Architecture middleware and a modified version of the Slurm scheduler, which is now able to assign remote GPUs to jobs. Results show that cluster throughput, measured as jobs completed per time unit, is doubled at the same time that the total energy consumption is reduced up to 40%. GPU utilization is also increased.Generalitat Valenciana, Grant/Award Number: PROMETEO/2017/077; MINECO and FEDER, Grant/Award Number: TIN2014-53495-R, TIN2015-65316-P and TIN2017-82972-RIserte, S.; Prades, J.; Reaño González, C.; Silla, F. (2021). Improving the management efficiency of GPU workloads in data centers through GPU virtualization. Concurrency and Computation: Practice and Experience. 33(2):1-16. https://doi.org/10.1002/cpe.5275S11633

    On the Deployment and Characterization of CUDA Teaching Laboratories

    When teaching CUDA in laboratories, an important issue is the economic cost of GPUs, which may prevent some universities from building large enough labs to teach CUDA. In this paper we propose an efficient solution to build CUDA labs reducing the number of GPUs. It is based on the use of the rCUDA (remote CUDA) middleware, which enables programs being executed in a computer to concurrently use GPUs located in remote servers. To study the viability of our proposal, we first characterize the use of GPUs in this kind of labs with statistics taken from real users, and then present results of sharing GPUs in a real teaching lab. The experiments validate the feasibility of our proposal, showing an overhead under 5% with respect to having a GPU at each of the students’ computers. These results clearly improve alternative approaches, such as logging into remote GPU servers, which presents an overhead about 30%.This work was partially funded by Escola Tècnica Superior d’Enginyeria Informàtica de la Universitat Politècnica de Valènciaand by Departament d'Informàtica de Sistemes i Computadors de la Universitat Politècnica de València.Reaño González, C.; Silla Jiménez, F. (2015). On the Deployment and Characterization of CUDA Teaching Laboratories. En EDULEARN15 Proceedings. IATED. http://hdl.handle.net/10251/70225

    Intra-node Memory Safe GPU Co-Scheduling

    [EN] GPUs in High-Performance Computing systems remain under-utilised due to the unavailability of schedulers that can safely schedule multiple applications to share the same GPU. The research reported in this paper is motivated to improve the utilisation of GPUs by proposing a framework, we refer to as schedGPU, to facilitate intra-node GPU co-scheduling such that a GPU can be safely shared among multiple applications by taking memory constraints into account. Two approaches, namely a client-server and a shared memory approach are explored. However, the shared memory approach is more suitable due to lower overheads when compared to the former approach. Four policies are proposed in schedGPU to handle applications that are waiting to access the GPU, two of which account for priorities. The feasibility of schedGPU is validated on three real-world applications. The key observation is that a performance gain is achieved. For single applications, a gain of over 10 times, as measured by GPU utilisation and GPU memory utilisation, is obtained. For workloads comprising multiple applications, a speed-up of up to 5x in the total execution time is noted. Moreover, the average GPU utilisation and average GPU memory utilisation is increased by 5 and 12 times, respectively.This work was funded by Generalitat Valenciana under grant PROMETEO/2017/77.Reaño González, C.; Silla Jiménez, F.; Nikolopoulos, DS.; Varghese, B. (2018). Intra-node Memory Safe GPU Co-Scheduling. IEEE Transactions on Parallel and Distributed Systems. 29(5):1089-1102. https://doi.org/10.1109/TPDS.2017.2784428S1089110229